Based on the Data Carpentry curriculum for Data Visualization in Python ( © Data Carpentry under Creative Commons Attribution license )
We will be using the publically available data on emergency department visits for asthma in California, from the California Health and Human Services Open Data Portal
The downloaded filename is asthma-ed-visit-rates-lghc-indicator-07.csv
, but we have saved a version of the file here as asthma.csv
(in the data directory).
First, we need to create a Pandas dataframe of our file:
In [ ]:
import pandas as pd
In [ ]:
df_asthma = pd.read_csv("data/asthma.csv")
In [ ]:
df_asthma.head()
This is a rather complicated table, as it stacks a number of different demographic categories in the same column (Strata
). We can appreciate this better by looking at all the unique values in some of our columns.
In [ ]:
df_asthma['Age Group'].unique()
In [ ]:
df_asthma['Strata'].unique()
In [ ]:
df_asthma['Strata Name'].unique()
Thus, we will probably want to work with subsets of our dataframe for analysis and visualization. For this example, we will first create a subset that only includes the All Ages
Strata group:
In [ ]:
df_subset = df_asthma[df_asthma['Strata Name']=='All Ages']
df_subset.head()
In [ ]:
df_asthma.size
In [ ]:
df_subset.size
Second, we will create a pivot table, to arrange the ED visit rate by year and geographical location:
In [ ]:
df_pivot = df_subset.pivot( values='Rate',
columns='Year',
index='Geography'
)
df_pivot.head(10)
Here is a useful post about reshaping data in pandas (pivot tables, stacking and unstacking columns): http://nikgrozev.com/2015/07/01/reshaping-in-pandas-pivot-pivot-table-stack-and-unstack-explained-with-pictures/
Matplotlib is a Python library that can be used to visualize data. The
toolbox matplotlib.pyplot
is a collection of functions that make matplotlib
work like MATLAB. In most cases, this is all that you will need to use, but
there are many other useful tools in matplotlib that you should explore.
We will cover a few basic commands for formatting plots in this lesson. A great resource for help styling your figures is the matplotlib gallery (http://matplotlib.org/gallery.html), which includes plots in many different styles and the source code that creates them. The simplest of plots is the 2 dimensional line plot. These examples walk through the basic commands for making line plots using pyplots.
In [ ]:
import matplotlib.pyplot as plt
By default, matplotlib will create the figure in a separate window. When using ipython notebooks, we can make figures appear in-line within the notebook by writing:
In [ ]:
%matplotlib inline
We can start by plotting the values of a list of numbers (matplotlib can handle many types of numeric data, including numpy arrays and pandas DataFrames - we are just using a list as an example!):
In [ ]:
list_numbers = [1.5, 4, 2.2, 5.7]
plt.plot(list_numbers)
plt.show()
The command plt.show()
prompts Python to display the figure. Without it, it
creates an object in memory but doesn't produce a visible plot. The ipython
notebooks (if using %matplotlib inline
) will automatically show you the figure
even if you don't write plt.show()
, but get in the habit of including this
command!
If you provide the plot()
function with only one list of numbers, it assumes
that it is a sequence of y-values and plots them against their index (the first
value in the list is plotted at x=0
, the second at x=1
, etc). If the
function plot()
receives two lists, it assumes the first one is the x-values
and the second the y-values. The line connecting the points will follow the list
in order:
In [ ]:
plt.plot([6.8, 4.3, 3.2, 8.1], list_numbers)
plt.show()
A third, optional argument in plot()
is a string of characters that indicates
the line type and color for the plot. The default value is a continuous blue
line. For example, we can make the line red ('r'
), with circles at every data
point ('o'
), and a dot-dash pattern ('-.'
). Look through the matplotlib
gallery for more examples.
In [ ]:
plt.plot([6.8, 4.3, 3.2, 8.1], list_numbers, 'ro-.')
plt.axis([0,10,0,6])
plt.show()
The command plt.axis()
sets the limits of the axes from a list of [xmin,
xmax, ymin, ymax]
values (the square brackets are needed because the argument
for the function axis()
is one list of values, not four separate numbers!).
The functions xlabel()
and ylabel()
will label the axes, and title()
will
write a title above the figure.
A single figure can include multiple lines, and they can be plotted using the
same plt.plot()
command by adding more pairs of x values and y values (and
optionally line styles):
In [ ]:
import numpy as np
# Create a numpy array between 0 and 10, with values evenly spaced every 0.5
t = np.arange(0., 10., 0.5)
# Red dashes with no symbols, blue squares with a solid line, and green triangles with a dotted line
plt.plot(t, t, 'r--', t, t**2, 'bs-', t, t**3, 'g^:')
plt.xlabel('This is the x axis')
plt.ylabel('This is the y axis')
plt.title('This is the figure title')
plt.show()
We can include a legend by adding the optional keyword argument label=''
in
plot()
. Caution: We cannot add labels to multiple lines that are plotted
simultaneously by the plt.plot()
command like we did above because Python
won't know to which line to assign the value of the argument label. Multiple
lines can also be plotted in the same figure by calling the plot()
function
several times:
In [ ]:
# Red dashes with no symbols, blue squares with a solid line, and green triangles with a dotted line
plt.plot(t, t, 'r--', label='linear')
plt.plot(t, t**2, 'bs-', label='square')
plt.plot(t, t**3, 'g^:', label='cubic')
plt.legend(loc='upper left', shadow=True, fontsize='x-large')
plt.xlabel('This is the x axis')
plt.ylabel('This is the y axis')
plt.title('This is the figure title')
plt.show()
The function legend()
adds a legend to the figure, and the optional keyword
arguments change its style. By default [typing just plt.legend()
], the legend
is on the upper right corner and has no shadow.
The functions xlabel
, ylabel
, title
, legend
, and many others create text labels. It is good to know that, in addition to the plain text, you may use mathematical notation using a subset of LaTeX language. See this link for more information.
Like MATLAB, pyplot is stateful; it keeps track of the current figure and
plotting area, and any plotting functions are directed to those axes. To make
more than one figure, we use the command plt.figure()
with an increasing
figure number inside the parentheses:
In [ ]:
# This is the first figure
plt.figure(1)
plt.plot(t, t, 'r--', label='linear')
plt.legend(loc='upper left', shadow=True, fontsize='x-large')
plt.title('This is figure 1')
plt.show()
# This is a second figure
plt.figure(2)
plt.plot(t, t**2, 'bs-', label='square')
plt.legend(loc='upper left', shadow=True, fontsize='x-large')
plt.title('This is figure 2')
plt.show()
A single figure can also include multiple plots in a grid pattern.
When you look at your graph (eg above), it is important to realize Matplotlib does not consider the graph to be the figure. The figure is the part around your graph. Your chart sits on top of the figure. So what’s your visualization?
Your graph is what’s called a subplot or axis. Or, technically, an AxesSubplot.
When we think about axes, we usually think about them as the lines on the left and right of our graphs. When plotting in python using Pandas/Matplotlib, you must not think in this way or you’re going to get confused! In this case, an axis - or an ax, as it is typically called - is an entire graph.
So, in the following example, we will create a figure, then add subplots to this single figure. The
add_subplot()
command specifies the number of rows, the number of columns, and
the number of the space in the grid that particular subplot is occupying:
In [ ]:
fig = plt.figure()
fig.add_subplot(2,2,1) # Two row, two columns, position 1
plt.plot(t, t, 'r--', label='linear')
fig.add_subplot(2,2,2) # Two row, two columns, position 2
plt.plot(t, t**2, 'bs-', label='square')
fig.add_subplot(2,2,3) # Two row, two columns, position 3
plt.plot(t, t**3, 'g^:', label='cubic')
plt.show()
Matplotlib can make many other types of plots in much the same way that it makes
2 dimensional line plots. Look through the examples in
http://matplotlib.org/users/screenshots.html and try a few of them (click on the
"Source code" link and copy and paste into a new cell in ipython notebook or
save as a text file with a .py
extension and run in the command line).
Challenge - Final Plot
Display your data using one or more plot types from the example gallery. Which ones to choose will depend on the content of your own data file. If you are using the streamgage file, you could make a histogram of the number of days with a given mean discharge, use bar plots to display daily discharge statistics, or explore the different ways matplotlib can handle dates and times for figures.
If you have a figure that you would like to save externally from the notebook, this can be achieved using the plt.savefig() command. At its most basic, call this function after producing your figure, but before calling plt.show(). In the parentheses, enter the full filename and extension you would like to save the image to. Matplotlib will look at the extension, and then save the image as the appropriate filetype. It can handle .png, .jpg, and .pdf, among others. Note that you will not get a warning if the file already exists - it will just overwrite the existing file with the new one, so make sure you check first! (or write some code to check...)
In [ ]:
import os
myfilename = "myfile.png" # change this to whatever you want the file to be called. Include any directories in the name too (eg. "results/myimage.jpg")
if os.path.exists(myfilename):
print("File already exists!")
else:
plt.savefig(myfilename)
The official Pandas plotting documentation helps show the extent of the types of plots and basic presentation options available.
The exercises here are based on the notes by Jonathan Soma, for the Algorithms course for the Lede Program at Columbia University Graduate School of Journalism.
First, let's refresh ourselves on what our dataframe looks like:
In [ ]:
df_asthma.head()
In [ ]:
df_asthma.plot()
plt.show()
As you can see, this is not a particularly useful plot in our case! It has plotted irrelevant data (LGHC Indicator ID), plotted all columns on the same scale, and the x-axis places all values sequentially as they appear in the table (including the year!).
In many cases (including ours), the index will not provide any meaningful relation to the data contained in that row. A meaningful index is most likely to exist in the following two cases:
.value_counts()
or .groupby()
.set_index()
, probably with datesSo except in the above cases, we will usually want to specify one column as the independent variable for the x-axis, and a second variable as the dependent variable for the y-axis. To do this, we specify these variables using the x
and y
parameters.
In [ ]:
df_asthma.plot(x='Year', y='Rate')
plt.axis([2010,2017,0,1000])
plt.show()
When pandas plots, it assumes every single data point should be connected. Naturally, a line graph is a poor choice to represent this kind of data. We can specify the type of chart to generate using the kind
parameter.
In [ ]:
df_asthma.plot(x='Year', y='Rate', kind='scatter')
plt.axis([2010,2017,0,1000])
plt.show()
This is better, but with a high density of points, a box plot would be a more helpful choice:
In [ ]:
df_asthma.plot(x='Year', y='Rate', kind='box')
plt.show()
This didn't work in this case, because the boxplot function works by drawing a single box for each column specified in the dataframe. In order to create the expected plot, we therefore have to use the pivoted table we created at the beginning:
In [ ]:
df_pivot.head()
In [ ]:
df_pivot.plot(kind='box')
plt.show()
You can then modify the style of the graph by supplying a (very large) number of possible parameters:
In [ ]:
df_pivot.plot(kind='box',
color=dict(boxes='#BDA493', whiskers='g', medians='red', caps='red'),
boxprops=dict(linestyle='-',linewidth=2.0, ),
whiskerprops=dict(linestyle='-.',linewidth=1.0,)
)
plt.show()
This graph is more more ugly than the default, but you can see how the values can be set for each of the different elements using a dictionary of values. If you want to set the same value for all elements, you can just specify that value once: color="green"
.
Colors can be specified by single letters:
or by html color codes ('red', 'orange', 'steelblue', etc)
or by hex color codes (#BDA493, #63C638, etc - A very handy color picker can be found here)
Challenge
Have a go at making a chart that is actually pleasant to look at!
We can also use Matplotlib's ability to plot multiple graphs onto a single figure. Using the same construction as in the first section, we can assign subplots (using the add_subplot()
command) to an axis.
When we pass ax=ax to our plot, we’re saying “hey, we already have a graph made up! Please just use it instead” and then pandas/matplotlib does, instead of using a brand-new image for each.
So what’s the difference between a figure and an axis/subplot? Figures are made up of subplots. Figures are the table that the dinner plates of subplots go on. So far we’ve only seen them made up of just one, but we can do more!
Let’s create a 2x1 grid and put something in the first subplot and something in the second subplot. In the example below, you will see we also use the shortcut 2,1,1
and 2,1,2
in the add_subplot
function. Remember this is the same as saying nrows=2, ncols=1, index=1
. You can be even more terse and enter it as a three-digit number: 211
In [ ]:
fig = plt.figure()
ax1 = fig.add_subplot(2,1,1)
df_pivot.plot(kind='box', ax=ax1)
ax2 = fig.add_subplot(2,1,2)
df_pivot.plot(ax=ax2, legend=None)
plt.show()
In [ ]:
# place your answer here:
If you want a graph to occupy multiple positions on your figure, you will need to expand the index parameter to indicate the top left and bottom right positions. We will also increase the figure size, since our graphs are starting to get quite small:
In [ ]:
fig = plt.figure(figsize=(10,6)) # figsize is a tuple indicating width x height (in inches)
ax1 = fig.add_subplot(3,3,(2,6))
df_pivot.plot(kind='box', ax=ax1)
ax2 = fig.add_subplot(3,3,(7,9))
df_pivot.plot(ax=ax2, legend=None)
plt.show()
Seaborn improves on the default styling and the ease of use compared with Matplotlib. The main thing to note here, is that seaborn is not natively integrated with Pandas, so you can't get its functionality just using the .plot() function. Instead, you will need to import Seaborn separately, then feed it your Pandas dataframe. As you will soon see, this small extra step is worth the trouble.
The Seaborn website has a lovely gallery of plots, to show you the scope of what it can do. The documentation for each plot type can be found here.
In [ ]:
import seaborn as sns
In [ ]:
ax = sns.boxplot(data=df_pivot)
Seaborn is a little smarter when it comes to dividing data in a column out into multiple categories. We can therefore produce the above graph without having to first generate the pivot table:
In [ ]:
ax = sns.boxplot(data=df_subset,
x='Year', y="Rate"
)
It is also possible to further break down a seaborn boxplot using the 'hue' parameter (to demonstrate this, we'll create another subset that is the total population strata, which includes three different age group breakdowns):
In [ ]:
df_subset2 = df_asthma[df_asthma['Strata']=='Total Population']
ax = sns.boxplot(data=df_subset2,
x='Year', y='Rate', hue='Age Group',
)
And by mapping onto a pre-existing figure as in the previous section, we can also set the size to be more appropriate:
In [ ]:
fig, ax = fig, ax = plt.subplots(figsize=(13,4))
ax = sns.boxplot(data=df_subset2,
x='Year', y='Rate', hue='Age Group',
ax=ax,
)
Getting better. To really get a feel for the underlying data, Seaborn can generate violin plots:
In [ ]:
fig, ax = fig, ax = plt.subplots(figsize=(13,4))
ax = sns.violinplot(data=df_subset2,
x='Year', y='Rate', hue='Age Group',
ax=ax,
)
It is also easy to produce nice looking histograms:
In [ ]:
sns.distplot(a=df_subset['Rate'].dropna())
plt.show()
Or if you wanted a scatterplot (we will need another subset of our data for a useful scatterplot):
In [ ]:
# create a new pivot table showing the rates per county according to age group:
df_pivot2 = df_subset2[df_subset2['Year']==2012].pivot(index='Geography', values='Rate', columns='Strata Name')
# view the first 5 rows:
df_pivot2.head()
In [ ]:
sns.lmplot(data=df_pivot2,
x='18 and Over', y='Under 18',
)
plt.show()
Now for a particularly complex version, we will add the data for 2013 and 2014:
In [ ]:
# create a new pivot table showing the rates per county according to age group:
df_pivot3 = df_subset2[df_subset2['Year']==2013].pivot(index='Geography', values='Rate', columns='Strata Name')
df_pivot4 = df_subset2[df_subset2['Year']==2014].pivot(index='Geography', values='Rate', columns='Strata Name')
# add a column with the year:
df_pivot2['Year']=2012
df_pivot3['Year']=2013
df_pivot4['Year']=2014
# stack the tables vertically
df_concat = pd.concat([df_pivot2, df_pivot3, df_pivot4], axis=0)
# view first 5 rows:
df_concat.head()
In [ ]:
sns.lmplot(data=df_concat,
x='18 and Over', y='Under 18',
hue='Year', # we want to display each year separately
markers=['x', '+', '*'], # change how each year's marker is displayed
ci=68, # change the confidence interval for the shading
fit_reg=True, # change to False if you want to remove the regression
col='Year', # split the graph into subgraphs (can also use 'row')
)
plt.show()
And if you also wanted the distribution of each variable plotted:
In [ ]:
sns.jointplot(data=df_pivot2, x='18 and Over', y='Under 18')
plt.show()
In [ ]: